supplementary material
Language-Bias-Resilient Visual Question Answering via Adaptive Multi-Margin Collaborative Debiasing
Language bias in Visual Question Answering (VQA) arises when models exploit spurious statistical correlations between question templates and answers, particularly in out-of-distribution scenarios, thereby neglecting essential visual cues and compromising genuine multimodal reasoning. Despite numerous efforts to enhance the robustness of VQA models, a principled understanding of how such bias originates and influences model behavior remains underdeveloped. In this paper, we address this gap through a comprehensive empirical and theoretical analysis, revealing that modality-specific gradient imbalances, which originate from the inherent heterogeneity of multimodal data, lead to skewed feature fusion and biased classifier weights. To alleviate these issues, we propose a novel MultiMargin Collaborative Debiasing (MMCD) framework2, which adaptively integrates frequency-aware, confidence-aware, and difficulty-aware angular margins with a dynamic, difficulty-aware contrastive learning mechanism to reshape decision boundaries under biased training conditions. Extensive experiments across multiple challenging VQA benchmarks confirm the consistent superiority of our proposed MMCD over state-of-the-art baselines in combating language bias.
Hybrid Re-matching for Continual Learning with Parameter-efficient Tuning
Continual learning seeks to enable a model to assimilate knowledge from nonstationary data streams without catastrophic forgetting. Recently, methods based on Parameter-Efficient Tuning (PET) have achieved superior performance without even storing any historical exemplars, which train much fewer specific parameters for each task upon a frozen pre-trained model, and tailored parameters are retrieved to guide predictions during inference. However, reliance solely on pretrained features for parameter matching exacerbates the inconsistency between the training and inference phases, thereby constraining the overall performance. To address this issue, we propose HRM-PET, which makes full use of the richer downstream knowledge inherently contained in the trained parameters. Specifically, we introduce a hybrid re-matching mechanism, which benefits from the initial predicted distribution to facilitate the parameter selections. The direct rematching addresses misclassified samples identified with correct task identity in prediction, despite incorrect initial matching. Moreover, the confidence-based re-matching is specifically designed to handle other more challenging mismatched samples that cannot be calibrated by the former. Besides, to acquire task-invariant knowledge for better matching, we integrate a cross-task instance relationship distillation module into the PET-based method. Extensive experiments conducted on four datasets under five pre-trained settings demonstrate that HRM-PET performs favorably against the state-of-the-art methods.
Disentangled Concepts Speak Louder Than Words Explainable Video Action Recognition
Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods--based on saliency--produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature--intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets--KTH, Penn Action, HAA500, and UCF101--demonstrate that DANCE significantly improves explanation clarity with competitive performance.
Appendices and Supplementary Material
A.1 Coordinate Systems and Transformation To achieve spatial synchronization between different sensors, vehicle-vehicle-UAV collaboration requires using sensor parameter information to perform coordinate system transformations. The relationships between the coordinate systems are illustrated in Fig. S 1. Figure 1: Relationship between coordinate systems. The pixel coordinate system refers to a two-dimensional coordinate system defined on the image plane, typically represented as (u,v), with units in pixels. In this system, the origin is located at the top-left corner of the image, the u-axis points to the right along the horizontal direction, and the v-axis points downward along the vertical direction. This coordinate system is used to describe the position of points on the two-dimensional image captured by the camera.
Statistical Guarantees for High-Dimensional Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the q-th moment convergence of SGD and ASGD for any q 2 in general โs-norms, and, in particular, the โ -norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.
MineAny Build: Benchmarking Spatial Planning for Open-world AIAgents
Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual QuestionAnswering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions.
3DHuman Pose Estimation with Muscles
We introduce MusclePose as an end-to-end learnable physics-infused 3D human pose estimator that incorporates muscle-dynamics modeling to infer human dynamics from monocular video. Current physics pose estimators aim to predict physically plausible poses by enforcing the underlying dynamics equations that govern motion. Since this is an underconstrained problem without force-annotated data, methods often estimate kinetics with external physics optimizers that may not be compatible with existing learning frameworks, or are too slow for real-time inference. While more recent methods use a regression-based approach to overcome these issues, the estimated kinetics can be seen as auxiliary predictions, and may not be physically plausible. To this end, we build on existing regressionbased approaches, and aim to improve the biofidelity of kinetic inference with a multihypothesis approach -- by inferring joint torques via Lagrange's equations and via muscle dynamics modeling with muscle torque generators. Furthermore, MusclePose predicts detailed human anthropometrics based on values from biomechanics studies, in contrast to existing physics pose estimators that construct their human models with shape primitives. We show that MusclePose is competitive with existing 3D pose estimators in positional accuracy, while also able to infer plausible human kinetics and muscle signals consistent with values from biomechanics studies, without requiring an external physics engine.
World Central Banks (Supplementary Material)
Kaggle3 Publishing our dataset on Kaggle offers distinct advantages over platforms like HuggingFace, par-4 ticularly in terms of usability and community engagement. Kaggle provides integrated tools for data5 visualization, version control, and collaborative discussion, streamlining the research workflow. Unlike Hug-7 gingFace, which is primarily model-focused, Kaggle is optimized for dataset-driven experimenta-8 tion, making it a more practical platform for sharing, validating, and improving data-centric work.9 Hosting the dataset on Kaggle thus ensures greater transparency, accessibility, and impact across10 both academic and applied research communities.11 Website12 Our World Central Banks website offers a structured and accessible overview of our research. It13 features a task-specific model leaderboard with direct links to download and explore the best per-14 forming model.
1 Supplementary Material
To investigate this further, we first observe that Claude-3.7-Sonnet Figure 1 shows the average pass rate under budgets of 12,000, 10 14,000, 16,000, and 17,000 tokens. As the data demonstrate, enlarging the thinking budget yields no 11 appreciable improvement in performance. This finding underscores 14 the challenging nature of ENGDESIGN and suggests its value as a rigorous testbed for future efforts 15 to enhance LLMs' engineering design proficiency. Figure 1: Average pass rate (%) of Claude-3.7-Thinking
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs (Supplementary Material)
In this section, we introduce the construction pipeline for generating MVU-Eval QA pairs based on2 each data source.3 These questions include: (1) Object Recognition, (2)8 Spatial Understanding, (3) Counting, (4) Knowledge-intensive Reasoning, and (5) Temporal9 Reasoning. These generated questions, answers, and candidate choices are manually checked by10 humans. Pipelines for constructing video pairs are slightly different across datasets.11 By default, 2-6 videos are randomly sampled, regardless of their labels.